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^O ■ Abstract 

In this paper we study the kernel multiple ridge regression framework, which we refer 
to as multi-task regression, using penalization techniques. The theoretical analysis of this 
problem shows that the key element appearing for an optimal calibration is the covariance 
matrix of the noise between the different tasks. We present a new algorithm to estimate 
this covariance matrix, based on the concept of minimal penalty, which was previously used 
in the single-task regression framework to estimate the variance of the noise. We show, 
^\ ' in a non-asymptotic setting and under mild assumptions on the target function, that this 

estimator converges towards the covariance matrix. Then plugging this estimator into the 
corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our 
algorithm on synthetic examples. 
Keywords: multi-task, oracle inequality, learning theory. 

1. Introduction 

A classical paradigm in statistics is that increasing the sample size (that is, the number of 
observations) improves the performance of the estimators. However, in some cases it may 
be impossible to increase this sample size, for instance because of experimental limitations. 



*. http://www.di.ens.fr/~solnon/ 
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Hopefully, in many situations practictioners can find many related and similar problems, 
and might want to use those other problems as if it gave more observations for his initial 
problem. The techniques using this heuristic are called "multi-task" techniques. In this 
paper we study the kernel ridge regression procedure in a multi-task framework. 

One-dimensional kernel ridge regression, which we refer to as "single-task" regression 
has been widely studied. As we briefly review in Section [3] one has, given n data points 
{Xi,Yi)'^^i, to estimate a function /, often the conditional expectation f{Xi) = E[Xj|l^], by 
minimizing the quadratic risk of the estimator regularized by a certain norm. A practically 
important task is to calibrate a regularization parameter, i.e., to estimate the regularization 
parameter directly from data. For kernel ridge regression (a.k.a. smoothing splines), many 
methods have been proposed based on diff erent principles, e.g., Bayesian c riteria through 
a Gaussian process interp retation (see, e.g., iRasmussen and Williams! . l2006l ) or generalized 
cross-validation (see, e.g.. IWahbal . Il99d ). In this paper, we focus o n the concept of minimal 
penalt y, which was first introduced by iBirge and MassartI ( 20071 ) and lArlot and Massart 
(I2OO9II for model se l ection , then extended to linear estimators such as kernel ridge regression 
bv lArlot and Bach! (|201lh . 

In this article we consider p > 2 different (but related) regression tasks, a framework 
we refer to as "multi-ta sk" regression. This setting has already been studied in different 
pape rs. Some of those (JThrun and O'SullivarJ . 1 19961 : ICaruanal . 119971 : iBakker and Heskesl . 
2OO3I ) empirically show that it can lead to performance improvement. iLiang et al.l ( 2010l ) 
also obtained a theoretical criterion (unfortunately non observable) which tells when this 
phenomenon asymptotically occurs. Several different path s have been fol l owed to deal with 
this setting. Some (see for instance lObozinski et al.l . l201ll : iLounici et al.l . l20ld ). consider a 
setting where p ^ n, and formulate a sparsity assumption which enables them to use the 
group Lasso, assuming that all the different functions have a small set of common active 
covariates. We exclude this setting from our analysis, because of the kernel nature of our 
problem, and thus will not consider the similarity between the tasks in terms of sparsity, 
but rather in terms of an Euclidiean similarity. An other theoretical ap pr oach has been also 
taken (see for example. iBrown and Zidekl ( 1980l ). lEvgeniou et al.l ( 20051 ) or lAndo and Zhang 
( 2OO5I ) on semi-supervised learning), the authors often defining a theoretical framework 
where the multi-task problem can easily be expressed, and where sometimes solutions can 
be computed. The main remaining theoretical problem is the calibration of a matricial 
parameter (typically of size p), which characterizes the relationship bewteen the tasks and 
extends the regularization parameter from the single-task regression. Because of the high 
dimensional nature of the problem (i.e., the small number of training observation s) usual 



Argyriou et al.l (J2008l ) have a 



techniques, like cross-validation, are not likely to succeed, 
similar approach to ours, but solve this problem by adding a convex constraint to the 
matrix, which will be discussed at the end of Section [5l Through a penalization technique 
we show in Section [2] that the only element we have to estimate is the correlation matrix T, 
of the noise between the tasks. We give here a new algorithm to estimate S, and show that 
the estimation is sharp enough to derive an oracle inequality, both with high probability 
and in expectation. Finally we give some simulation experiment results and show that our 
technique correctly deals with the multi-tasks settings with a low sample-size. 



Notations. We now introduce some notations, which will be used throughout the article. 
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• The integer n is the sample size, the integer p is the number of tasks. 

• For any n x p matrix Y, we define 

y = vec(y) := (yi,i, . . . , y„,i, yi,2, . . . , yn,2, • • • , ^1,^, • • • , Yn,p) e M"^ 

that is, the columns Y^ := {Yij)i<i<n are stacked. 

• A^n(M) is the set of all matrices of size n. 

• 5p(M) is the set of symmetric matrices of size p. 

• 5^(M) is the set of symmetric positive-semidefinite matrices of size p. 

• 5^~'^(M) is the set of symmetric positive-definite matrices of size p. 

• :< denotes the partial ordering on 5p(M) defined by: A ^ B ii and only if B — A £ 

5+(M). 

• 1 is the vector of size p whose components are all equal to 1. 

• ll'llg is the usual Euclidean norm on M'"' for any /c G N: Vu G M''^ \W\\2 '■— Si=i ^i • 

2. Multi-task regression: problem set-up 

We consider p kernel ridge regression tasks. Treating them simultaneously and sharing their 
common structure (e.g., being close in some metric space) will help in reducing the overall 
prediction error. 

Let X be some set and T a set of real- valued fu nctions over ^. W e suppose T has a 
reproducing kernel Hilbert space (RKHS) structure ( AronszajnI . Il950|), with kernel k and 



feature map $ : A' ^ J". We observe P„ = (Xi,y/, . . . ,1;^)^^^ e {X x M^)", which 
give us the positive semidefinite kernel matrix K = {k{Xi, Xj))i<ij<n G '^^(I^)- For each 
task j G {1, . . . ,p}, Vn = iXi,yl)f^i is a sample with distribution Vj, for which a simple 
regression problem has to be solved. In this paper we consider for simplicity that the 
different tasks have the same design {Xi)f^^. When the designs of the different tasks are 
different the analysis is similar, but the notations would be more complicated. 

We now define the model. We assume (/^, . . . , /^) E J-^, S is a symmetric positive- 
definite matrix of size p such that the vectors {^D^^i are i.i.d. with normal distribution 
A/'(0, S), with mean zero and covariance matrix S, and 

ViG{l,...,n},ViG {!,..., p}, yi = f{Xi)+ei . 

This means that, while the observations are independent, the different tasks are correlated, 
with correlation matrix S between the tasks. We now place ourselves in the fixed-design 
setting, that is, (Xj)"^j^ is deterministic and the goal is to estimate (/^(Xj), . . . , f^{Xi))^^^. 
Let us introduce some notation: 

• l-i-mm = Mmin(5^) (resp. ^max) denotes the smallest (resp. largest) eigenvalue of S. 

• c(T,) := /Umax//Umin is the Condition number of S. 
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To obtain compact equations, we will use the following definition: 

Definition 1. We denote by F the nxp matrix (/■'(-'^j))i<i<n,i<j<p <md introduce the 
vector f := vec(F) = (/^(Xi), . . . , /i(X„), . . . , /p(X„)) G M"^, obtained by stacking the 
columns of F. Similarly we define Y := (yj) £ Ainxpi^), y '■= vec(y), E 



(^.j e 



Mn 



xpV 



and e := vec(-E). 



In order to estimate /, we use a regularization procedure, which extends the classical 
ridge regression of the single-task setting. Let M be a p xp matrix, symmetric and positive- 
definite. Generalising the work of lEvgeniou et al.l ( 20051 ). we estimate / = (/^, . . . , f^) G F'p 
by 



/a/ G argmin 



{11 p p p 1 

^ i=i j=i j=i 1=1 J 



(2.1) 



Remark 1. Requiring that M ^ implies that Eq. (12. ip is a convex optimization problem, 
which here, because we consider the square loss, can be solved through the resolution of a 
linear system, as explained later. Moreover it allows an RKHS interpretation, which will 
also be explained later. 

Example 1. The case where the p tasks are treated independently can be considered in 
this setting: taking M = A/ind(A) := Diag(Ai, . . . , Ap) for any A G W, which leads to the 
criterion 






(2.2) 



that is, the sum of the single-task criteria described in Section 0. Hence, minimizing 
Eq. (j2.2p over A G M^ amounts to solve independently p single task problems. 



Example 2. As done bu xEvaeniou et al\ l200dt ). for every A,/i G (0, +oo) , define 

/X + {p-l)lJ. -^ \ 

Afsimilar(A, /i) := (A + pfl)Ip - flll~^ = 

V -/^ A + (p - l)fij 

Taking M = Msijniiar(A,/i) in Eq. (j2.ip leads to the criterion 



(2.3) 



^ n p p P P 

'^P ,_1 „_1 „_1 „_1 7—1 



9' - 9 



i=i i=i 



j=i 



j=i fc=i 



(2.4) 



Minimizing Eq. (|2.4p enforces a regularization on both the norms of the functions g^ and 
the norms of the differences g^ — g . Thus, matrices of the form Msijniiar(A, ^) are useful 
when the functions g^ are assumed to be similar in J-. One of the main contribution of the 
paper is to go beyond this case and learn from data a a similarity matrix M between tasks. 

Example 3. We extend Example IM to the case where the p tasks consist of two groups of 
close tasks. Let I be a subset of {1, ... ,p}, of cardinality 1 < k < p — 1. Let us denote 
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by I'^ the complementary of I in {1, . . . ,p}, Ij the vector v with component Vi = Ije/, and 
Diag(/) the diagonal matrix d with component di^i = Ijg/. We then define 



Mi{X,n,v) := A/p + ;uDiag(/) + vT)\a.g{I'' 






V -J 



(2.5) 



This matrix leads to the following criterion, which enforces a regularization on both the 
norms of the functions g^ and the norms of the differences g^ — g inside the groups I and 
F : 



n p 






i=i 



jei k<=i 



2 1/ 

T p — k ^ ^ , ^ 



EElk-^^ 



estimate the set I from data (see \jacob et al\ 112008 ) ft 



(2.6) 



for a 



As shown in Sectionl^i we can 
more general formulation). 

Remark 2. Since Ip and 11 can be diagonalized simultaneously, minimizing Eq. (12. 4p 
and Eq. (j2.6p is quite easy: it only dema nds optimization over two independent parameters, 
which can be done with the procedure of lArlot and Bacn 1(201 i ). 



Remark 3. As stated below (Proposition^^ , M acts as a scalar product between the tasks. 
Selecting a general matrix M is thus a way to express a similarity between tasks. 

Following lEvgeniou et al.l ( 20051 ). we define the vector-space Q of real- valued functions 
over A' X {1, ... ,p} by 

g:={g:Xx{l,... ,p} ^ M/Vj G {1, . . . ,p} , g{;j) G T} . 

We now define a bilinear symmetric form over Q, 

p p 
yg,heg , {g,h)g:=^^MjM-d)^H;l))T, 

j=i 1=1 

which is a scalar product (see proof in Appendix [Aj) and leads to a RKHS (see proof in 
Appendix [B]) : 

Proposition 2. With the preceding notations (•, ■)g is a scalar product on Q. 

Corollary 1. {G,{-,-)g) is a RKHS. 

In order to write down the kernel matrix in compact form, we introduce the following 
notations. 

Definition 3 (Kronecker Product). Let A £ Mm,nW, B G Mp^q(R). We define the 
Kronecker product A(i^ B as being the (mp) x (nq) matrix built with p x q blocs, the block 
of index {i,j) being Aij ■ B: 



A®B 



Ai^iB ... Ai^nB 

i.Am,lB . . . Afn jiB , 
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The Kronecker product is a widely used tool to deal with m atrices and tensor produc ts. 
Some of its classical properties are given in Section [Dj see also lHorn and JohnsonI ( 199ll ). 



Proposition 4. The kernel matrix associated with the design X := (Xi,j)ij £ X x 
{1, . . . ,p} and the RKHS {Q, (•, ■)g) is Km := M'^ ® K. 

We can then apply the representer's theorem ( Scholkopf and Smolal . 12002 ) to the mini- 
mization problem (j2.ip and deduce that Jm = AmV with 

Am = Am,k ■■= Km{Km + npInpY^ = (M"! » K) ((M"! (^ K) + nlnp)"^ ■ 

Now when working in multi-task regression, a set Ai C 5j^"''(M) of matrices M is 
given, and the goal is to select the "best" one, that is, minimizing over M the quadratic 
risk n~^\\fM — /Hi- ^^r instance, the single-task framework corresponds to p = 1 and 
7W = (0, +oo). The multi-task case is far richer. The ideal choice, called the oracle, is 

M* £ argmin <^ Jm - f 

AieM I 2 



Howev er M* is not an estimator, since it depends on /. As explained bv lArlot and Bach 
( 201ll ). we choose M as a minimizer over A4 of 



crit(M) = — 
np 



y- Im + pen(M) , 



where the penalty term pen( M) has to be chosen appropriately. The unbiased risk estima- 
tion principle (introduced by lAkaikd . Il97d ) requires 



E[crit(M)] «E 



np 



which leads to the (deterministic) ideal penalty 



penid(M) := E 



-ll/A/-/lli 

np 



Im - f 



E 



np 



y - Im 



Since /m = AMy and y = f + e, we can write 



A'l 



Im - f 



+ e 



2{e,AMe) + 2{e,{Inp-AM)f) ■ 



Since e is centered and M is deterministic, we get, up to an additive factor independent 
of M, 

2E[{e, Ams)] 



penid(M) 



np 



that is, as the covariance matrix of e is S (^ /„, 

2tr (Am -(S «)/„)) 



penid(M) 



np 



(2.7) 



In order to approach this penalty as precisely as possible, we have to sharply estimate S. 
In the single-task ca ,se, such a probl e m red uces to estimating the variance cr^ of the noise 
and was tackled by lArlot and BachI ( 201ll ). Since our approach for estimating S heavily 
relies on these results, they are summarized in the next section. 
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3. Single task framework: estimating a single variance 



This section recalls some of the main results from lArlot and BachI (|201lh and can be con- 
sidered as a special case of Section [21 with p = 1, S = cr^ > and A1 = [0, -|-oo]. Writing 
M = X with A G [0, +oo], the regularization matrix is 

-1 



VA G (0, +oo) , Ax = Ax^K = K{K + nXIn)' 
Aq = In and A+oo = 0; the ideal penalty becomes 

2aHr{Ax) 



(3.1) 



penid(A) 



n 



By analogy with the case where Ax is an orthogonal pro jection matrix, df(A) := tr(A \) is 
called the effec tive degree of freedom, first introduced bv lHastie and Tibshiranil ( 1990l ) and 
generalized by IZhangI ( 20051 ). The ideal penalty however depends on a^; in order to have 
a fully data-driven penalty we have to replace a'^ by an estimator a^ inside penj(j(A). For 
every A G [0, +oo], define 

,,, ,, ,,, i2triAx,K )-tr{Alj,Ax,K)) 
pen^i„(A) = pen^i^{\,K) := 



n 



Theoretical arguments show that when a penalty proportionnal to penj^j^(A) is chosen, then 
if the proportionnality coefficient is smaller than a'^ /n the procedure overfits, while when 
this coefficient is greater than a'^ /n the procedure leads to good estimation properties and a 



l ow ef fective degree of freedom. The following algorithm was introduced in lArlot and Bach 
( 201ll ) and uses this fact to estimate <t^. 



Algorithm 1. Input: Y G 

1. For every C > 0, compute 



\KeS++iR) 



A\K) 



Ao(C) G argmin i - \\Ax,kY - Y\\l + Cpen^ 

A6[0,+oo] l'^ 

2. Output: C such that df(Ao(C')) G [n/10,n/3]. 

An efficient algorithm for the first step of Algorithm [1] is detailed in lArlot and Massart 
( 2003 ) ■ and we discuss the way we implemented Algorithm [1] in Section [6j The output C 
of Algorithm [1] is a provably consistent estimator of o"^, as stated in the following theorem. 



Theorem 5 (Corollary of Theorem 1 of lArlot and BachI (J201ll )). Let (3 = 150 and a = 2. 
Suppose e G AA(0, cr^/„) with g^ > 0, and that Aq G (0, -|-oo) and dn> I exist such that 



df(Ao) < 



n and - ||(^Ao - In)F\\2 < dna"^ 



Inn 
n 



(3.2) 



Then for every 5 > 2, some constants nQ{6), n > and an event 0, exist such that ¥(0,) > 
1 — Kn and for n > no (5), on Q, 



Inn 



l-^{a + S)\ \a' <C< 1 + /3(q + 6)dn 



n 



ln(n) 



n 



(3.3) 



Remark 4. The values n/10 and n/3 have no particular meaning and can be replaced by 
n/k, n/k' , with k > k' > 2. Only j3 depends on k and k' . 
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4. Estimation of the noise covariance matrix E 



Thanks to the results developped bv lArlot and BachI (J201ll ) (recapitulated in Section [3|), we 



know how to estimate a variance for any one-dimensional problem. In order to estimate S, 
which has p{p + l)/2 parameters, we can use several one-dimensional problems. Projecting 
Y onto some direction z G M^ yields 

Y, := Y ■ z = F ■ z + E ■ z = F, + £, , (Pz) 

with Ez ~ AA(0, a'^In) and cr^ := Var[e • z] = z^YjZ. Therefore, we will estimate a\ ior z ^ Z 
a well chosen set, and use these estimators to build back an estimation of S. 
We now explain how to estimate S using those one-dimensional projections. 

Definition 6. Let a(z) be the output C of Algorithmic applied to problem (IPzj) . that is, 
with input Yz G M" and K G S^~ 



The idea is to apply Algorithm [T] to the elements z of a carefully chosen set Z. Noting 

a the i-th vector of the canonical basis of W, we introduce Z = {e,,, i G {1, . . . ,p}} U 

{ej + ej, I < i < j < p}- We can see that a{ei) estimates Sj^j, while a{ei + Cj) estimates 

Sj^j -|- Sjj- -|- 2T,ij. Henceforth, Sjj can be estimated by {a{ei + Cj) — a{ei) — a{ej))/2. This 

leads to the definition of the following map J, which builds a symmetric matrix using the 

latter construction. 

p(p+i) 
Definition 7. Let J : M 2 — ;. 5p(M) be defined by 

J(ai, ...,ap, ai,2, • • • , ai,p, • • • , ap_i,p)i,i = 0, if I < i < p , 

J(ai, . . . , flp, ai,2, • • • , ai,p, • • • , ap-i,p)i,j = -^ ^^ if <i<j<p ■ 



This map is bijective, and for all B G Sp{ 

J~ (B) = (i?i,i, . . . , Bp^p, Bi^i + i?2,2 + 2i?i,2, • • • , -Bp-i,p-i + Bp^p + 2Bp^i^p) . 
This leads us to defining the following estimator of S : 

S := J(a(ei),...,a(ep),a(ei -^62),... ,a(ei + Cp) , . . . , a{ep-i + Cp)) . (4.1) 



Le t us recah that VA G (0, -l-oo). Ax = Axk = K{K-^n\Ln) ^. Following I Ar lot and Bach 
( 201ll ) we make the following assumption from now on: 



VjG {!,..., p}, 3Aoj G (0,+oo) 



(Hdf) 



111 1 1 9 / in 7X 

df (Ao J )<V^ and - 1 1 (^Ao,, - In)Fe,\\{< ^j,, ^ — 

We can now state the main result of the paper. 

Theorem 8. Let T, be defined by Eq. (14. ip . a = 2, k > be the numerical constant defined 
in Theorem\^ and assume (jHdfp holds. For every 5 > 2, a constant no{5), an absolute 
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constant Li > and an event Jl exist such that F{Q,) > 1 — np'^n and for every n > nQ{6), 
on 0,, 



(1 - ?7)S ^ S ^ (1 + r/)S , (4.2) 



n I T7 1 

where rj := Li(a + 6)p\l cCH) . 

V n 

Theorem [8] is proved in Section |Dl It shows S estimates S with a "muhiphcative" error 
controhed with large probabihty, in a non-asymptotic setting. The multiphcative nature of 
the error is crucial for deriving the oracle inequality stated in Section [H since it allows to 
show the ideal penalty defined in Eq. (j2.7p is precisely estimated when S is replaced by S. 

An important feature of Theorem [8] is that it holds under very mild assumptions on 
the mean / of the data (see Remark [5]). Therefore, it shows S is able to estimate a 
covariance matrix without prior knowledge on the regression function, which, to the best of 
our knowledge, has never been obtained in multi-task regression/ 

Remark 5 (On assumption (IHdfj) ). Assumption (]Hdf|l is a single-task assumption (made 
independently for each task). The upper bound ■\/ln(n)/n can be multiplied by any factor 
1 < dn <^ \/n/ ln{n) (as in Theorem l^, at the price of multiplying rji by dn in the upper 
bound of Eq. (jM]). 

Assumption (jHdfp is rather classical in model selection, see lArlot and BacH 1(201 A) for 



instance. In particular, (a weakened version of) (jHdfj) holds if the bias n ^\\{Ax — In)Fei\\% 
is bounded by Ci tT{Ax)~'^^ , for some Ci, C2 > 0. 

Remark 6 (Scaling of {n,p) for consistency). A sufficient condition for ensuring T, is a 
consistent estimator of T, is 

pc(S)^/35^0, 
V n 

which enforces a scaling between n, p and c(S). Nevertheless, this condition is probably not 
necessary since the simulation experiments of Section show that S can be well estimated 
(at least for estimator selection purposes) in a setting where r] ^ 1. 

Remark 7 (Choice of the set Z). Other choices could have been made for Z, however ours 
seems easier in terms of computation, since \Z\ = p{p + l)/2. Choosing a larger set Z leads 
to theoretical difficulties in the reconstruction of E, while taking other basis vectors leads to 
more complex computations. We can also note that increasing \Z\ decreases the probability 
in Theorem\^ since it comes from an union bound over the one- dimensional estimations. 

5. Oracle inequality 

We now show that the estimator introduced in Eq. (|4.ip is precise enough to derive an 
oracle inequality when plugged in the penalty defined in Eq. (j2.7p . 

Definition 9. Let T, be the estimator of S defined by Eq. (j4.ip . We define 

2 



M E argmin 

MeM 



fhi - y 



^ + 2tr(^A/-(S®I„)) \ . (5.1) 
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We assume the following assumption, which means that the matrices of Ai are jointly 
diagonalisable, holds true: 



3P G Op{R) , 7W C {pT Diag(di, ...,dp)P, {diY^ ^ (0, +00)^} 



(HM) 



Theorem 10. Let a = 2, 5 > 2 and assume (jHdfj) and (jHMp hold true. Absolute constants 
L2 > and k' , a constant ni{6) and an event Q, exist such that F[0.) > 1 — K'p'^n~ and the 
following holds as soon as n > ni{6). First, on Q, 



1 

up 



J kf J 



2 V ln(nj 



int (J- 

M&M y np 



Im - f 



+ L2c(S)4tr(S)(a + <5)2^'^^^"^ 



np 



(5.2) 



Second, 



E 



np 



JM J 



' M 



< 1 + 



ln(n) 



E 





f 1 


^ 
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inf < 




Im - 


-f 




MeM 


[np 






2jJ 



+L2c(S)'^tr(S)(a + (5) 



2P^ln(n)^ p 



+ 



(5.3) 



np n^/'^ np 

Theorem 1101 is pr oved in Section [El Tak ing p = 1 (hence c(S) = 1 and tr(S) = cr^), we 
recover Theorem 3 of I Ar lot and BachI ( 201ll ) as a corollary. 



Remark 8. Our result is a non asymptotic oracle inequality, with a mutliplicative term of 
the form 1 + o(l). This allows us to claim that our selection procedure is nearly optimal, 
since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. 
Furthermore the term 1 + (ln(n))~^ in front of the infima in Eq. (j5.2p and ()5.3p can be 
further diminished, but this yields a greater rest as a consequence. 



Remark 9 (On assumption (JHMJ)). Assumption (jHMp actually means all matrices in 
Ai can be diagonalized in a unique orthogonal basis, and thus can be parametrized by their 
eigenvalues. In that case the optimization problem is quite easy to solve. If not, solving 
(j5.ip may turn out to be a hard problem, and our theoretical re sults do not cov e r this set- 
ting. However, it is always possible to discretize the set M as in lArlot and BacH \201 A) or, 
in paractise, to use gradient descent; we conjecture Theorem [TO] still holds without (jHMp 
as long as Ai is not "too large", which could be proved similarly up to some uniform con- 
centration inequalities. 

Note also that if Aii, .. . ,AiK M satisfy (IHMp (with different matrices P), then The- 
oremlW\ still holds for A4 = Ufc=i-^fc with F{^) > 1 — 9Kp'^n~ , by applying the union 
bound in the proof. 

Remark 10 (Scaling of (n,p)). Eq. (15. 2p implies the asymptotic optimality of the estimator 



frr when 



c(E)' 



^trS p'^(ln(n))' 



p 



n 



< inf i — 

MeM I np 



fhl - f 



In particular, only {n,p) such that p^ <^ n/(ln(n))'^ are admissible. 
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Remark 11 (Relationship w ith the trace norm) . O ur framework relies on the minimization 
of Eq. h2.1\) with respect to f . lAravriou et all 1(2003) has shown that if we also minimize with 



respect to the matrix M subject to the constraint trM~ = 1, then we obtain an equivalent 
regularization by the nuclear norm (a.k.a. trace norm), which implies the prior knowledge 
that our p prediction functions may be obtained as the linear combination of r <^ p basis 
functions. This situation corresponds to cases where the matrix M~^ is singular, and we 
allow this explictly in our experiments. 

Note that the link between our framework and trace norm (i.e., nuclear norm) regular- 
ization is the sarne tha n between multiple kernel learning and the single task framework of 
Arlot and BacH 1(201 A) . In the multi-task case, the trace-norm regularization, though effi- 



cient computationally, does not lead to oracle inequality, while our criterion is an unbiased 
estimate of the generalization error, which t urns out to be non- convex in the matrix M . 



While DC programming techniques (see, e.g. \Gasso et all \200(A . and references therein) 



could be brought to bear to find local optima, the goal of the present work is to study the 
theoretical properties of our estimators, assuming we can minimize the cost function (e.g., 
in special cases, where we consider spectral variants, or by brute force enumeration). 

6. Simulation experiments 

In all the experiments presented in this section, we consider the framework of Section [2] 
with X = W^, d = 4., and the kernel defined by \/x,y G X, k{x,y) = 11^=1 e"'"^^"^^' • 
The design points Xi, . . . ,Xn G M'^ are drawn (repeatedly and independently for each 
sample) independently from the multivariate standard Gaussian distribution. For every 
j G {1, . . . ,p}, /•'(•) = Yl^i Q^i^(') ^i) where m = 4 and zi, . . . , Zm G M'^ are drawn (once for 
all) independently from the multivariate standard Gaussian distribution, independent from 
the design (Xj)i<j<„. Thus, the expectations that will be considered are taken conditionally 
to the Zi. The coefficients {cil)i<i<m ,i<j<p differ according to the setting. 

Settings. Four experimental settings are considered: 

AJ Various numbers of tasks: n = 100 and Vi,j, aj = 1, that is, Vj, f^ = /a := 
Xli^i k{-, Zi). The number of tasks is varying: p G {2k / k = 1, . . . , 25}. The covari- 
ance matrix is S = ^A,p defined as the first pxp block of T,a,5o, where Sa,50 has been 
drawn (once for all) from the Wishart W^(/5o, 50, 100) distribution. The condition 
number of T,A,p increases from c{T,a,2) ~ 1-17 to c{T,a,5o) ^ 22.50 as p increases. 

BJ Various sample sizes: p = 5, Vj, f^ = fA and T, = T,b has been drawn (once 
for all) from the Whishart W{I^, 10, 5) distribution; the condition number of T,b is 
c(Sb) ~ 22.05. The only varying parameter is n G {50A; / fc = 1, . . . , 20}. 

CJ Various noise levels: n = 100, p = 5 and \/j, f^ = fA- The varying parameter is 

S = ^(J^ ■= tT^B with t G {0.5/fc / A; = 1, . . . , 20}. 

DJ Clustering of tw^o groups of functions p = 10, n = 100, S = Y^e has been 
drawn (once for all) from the Whishart VF(/io, 20, 10) distribution; the condition 
number of T^e is ciT^E) ~ 24.95. We pick the function / := X^I^i Oii^{'> Zi) by drawing 
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(ai, . . . , Om) from standard multivariate normal distributions and finally /^ = • • • = 
/' = /,/' = ••• = /'° = -/. 

Collections of matrices. Two different sets of matrices A4 are considered in the Exper- 
iments A-C, following Examples [1] and [2 

-^similar := \ M.i^Uar (A, ^) = (A + pi^l)Ip - ^11^ / (A, ^) G (0, +Oo)2 

and TWind := {A4id(A) = Diag(Ai, . . . , Ap) / A G (0, +00)^} . 
In Experiment D, we also use two different sets of matrices, following Examples [3] : 

A^cius:= U {M/(A,^,/x) /(A,/x) G (0,+Oo)2}|jA1similar 

/C{l,...,p},/^{{l,...,p},0} 

and A^ interval := U {M/ (A, /i,/x) /(A, ^) G (0,+00)^ / = {!,..., fc}}|j7Wsimilar ■ 
l</c<p-l 

Remark 12. The set TWclus contains 2^ — 1 models, a case we will denote by "clustering". 
The other set, A^intcrvab only has p models, and should take advantage of the knowledge of 
the structure of the Setting D. We call this setting "segmentation into intervals" . 

Estimators. Concerning Experiments A-C combining the two possible sets of matrices 
with two penalization procedures (that is, with the penalty defined in Eq. (j2.7p and either 
S known or estimated by S) leads to four estimators defined by 



Va G {similar, ind} , MS G ji:, s| , f^^s ■= I^^ ^ = ^^^ 

y - fhi 



V 



where Ma 5 G argmin < — 



+ —tl{AM-iS^In)) 

2 np 



and S is defined by Eq. (j4.ip . As detailed in Examples [THll /j^^ g and /ind,s are con- 
catenations of single-task estimators, whereas fg:^^^^^. g and /simiiar.s should take advan- 
tage of a setting where the functions f^ are close in J^ thanks to the regularization term 



E,,JP-fT^- 



Concerning Experiment D, given the two possible sets of matrices, plus the single-task 
matrices, we consider the following three estimators : 

V/3 G {clus, interval, ind} , fjs := f^^ = A^^y 
1 



where Mp G argmin 

M&Mf, l"-P 



y- f. 



M 



2 2 

+ —iT{AM-{S®In)) 

2 np 



Remark 13 (Finding the jump in Algorithm [T]). The Pseudo-Algorithm{I\ raises the ques- 
tion of how to detect the jump of astimated dimensionality ofdf{X), which happens around 
the variance we want to estimate. We chose to select an estimator of C of a^ such that 
it was the smallest index such that df(Ao(C)) < n/2. An other approach was attempted, 
namely by choosing the index corresponding to the largest instantaneous jump o/df(Ao(C)) 
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(which is piece-wise constant and non-increasing) . This approach had a major drawback, 
because it sometimes selected a jump far away from the "real" jump, which consisted of 
several small jumps. Both these approaches gave similar results in terms of prediction er- 
ror, and we chose the first one because of its direct link to our theoretical criterion given in 
Algorithm[l\ 

Results. The results of the Experiments A-C are reported in Figures [THSl In each exper- 
iment, A^ = 1000 independent samples y E M"^ have been generated. Due to computation 
time, only A^ = 100 samples have been generated in simulations B. Expectations are es- 
timated thanks to empirical means over the A^ samples, and error bars correspond to the 
classical Gaussian 5% level difference test (that is, empirical variance over the A^ samples 
multiplied by 1.96/viV). The results of Experiment D are reported in Table [1] . Here 
also the expectations are estimated thanks to empirical means over the 1000 samples. The 
p- value corresponds to the classical Gaussian difference test, where the hypotheses tested 
are of the shape Hq = {q < 1} against the hypotheses Mi = {q >1}, where the different 
quantities q are detailed later. We compute the p- value of the test that is, the minimal risk 
that leads the tests to reject the tested hypothesis. 
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Figure 1: Increasing the number of tasks p (Setting A), quadratic errors of multi-task esti- 
mators {np)-^E[\\U^n.r,s - /f ]• Blue: 5 = S. Red: 5 = S. 



Comments. As expected, multi-task learning significantly helps when all f^ are equal, as 
soon as p is large enough (Figure ED , especially for small n (Figure ED and large noise- levels 
(Figure [HD. Increasing the number of tasks rapidly reduces the quadratic error with multi- 
task estimators (Figure [ID contrary to what happens with single-task estimators (Figure [2D . 
A noticeable phenomenon also occurs in Figure [J and even more in Figure [2) the estimator 



13 



SOLNON, ArLOT and BACH 



1.4r 



^ 1.2- 

CO 



With the estimated E 
With the true E 



I TtI 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 



■^„0.8h 

(/) 

o 

00.6- 

o 

I 0.1 

CO 

O 0.2^ I 



I 



^ I 3C HE 31 



-iiL. ~nr ~nr -nr ~nr 



10 



15 



20 



25 
P 



30 



35 



40 



45 



50 



Figure 2: Increasing the number of tasks p (Setting A), quadratic errors of single-task esti- 
mators (np)-iE[||^nd,5 - /f ]• Blue: S = t. Red: 5 = S. 



Quantity estimated : q 


Mean 


Empirical variance 


Mo 


p- value 


E[||/clus-/||Vlliind-/f] 


0.6683 


0.2935 


q>l 


< 10-15 


K[||/i„tcrval-/||Vll/ind-/f] 


0.6596 


0.2704 


q>l 


< 10-15 


IE[||/mtcrval-/||Vll/clus-/f] 


0.99998 


0.1645 


q>l 


0.5006 



Table 1: Clustering and segmentation (Setting D). 



/ind.E (that is, obtained knowing the true covariance matrix S) is less efficient than /■ ^ g 
where the covariance matrix is estimated. It corresponds to the combination of two facts: 
(i) multiplying the ideal penalty by a small factor 1 < C„ < 1 + o(l) is known to often 



i mpro ve performances in practice when the sample size is small (see Section 6.3.2 of I Arlot 
( 20091 )). and (ii) minimal penalty algorithms like Algorithm [1] are conjectured to overpenalize 
slightly when n is small or the noise- level is large ( Leraslel . l201ll ) (as confirmed by Figure [7]). 
Interestingly, this phenomenon is stronger for single-task estimators (differences are smaller 
in Figure [T]) and disappears when n is large enough (Figure [5]) , which is consistent with 
the heuristic motivating multi-task learning: "increasing the number of tasks p amounts to 
increase the sample size" . However the advantage of the multi-task procedure (compared 
to the single task one) seems to decrease when p becomes very large, as seen in Figure El 
This seems reasonable since the multi-task procedure requires the estimation of the matrix 
S that is, p{p + l)/2 parameters, and thus induces a large variance when n is small (here, 
p = 50 and n = 100). 
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Figure 3: Increasing the number of tasks p (Setting A) , improvement of multi-task compared 
to single-task: E[||4^. g - /HVll^. £ " /P]- 



Figures [4] and El show us that our procedure works well with small n, and that increasing 
n does not seem to significantly improve the performance of our estimators, except in 
the single-task setting with E known, where the under-penalization phenomenon discussed 
above disappears. 

Table [1] shows us that using the multitask procedure benefits the estimation accuracy, 
both in the clustering setting and in the segmentation setting. The last line of Table [1] does 
not show that the clustering setting improves over the "segmentation into intervals" one, 
which was awaited if both select a model close to the oracles, which are the same on both 
cases. 

7. Conclusion and future work 

This paper shows that taking into account the unknown similarity between p regression tasks 
can be done optimally (Theorem I lOp . The crucial point is to estimate the p x p covariance 
matrix S of the noise (covariance between tasks). An estimator of S is defined in Section U 
where non-asymptotic bounds on its error are provided under very mild assumptions on the 
mean of the sample (Theorem [8]), which is probably the most important theoretical result 
of the paper. 

Simulation experiments show that our algorithm works with reasonable sample sizes, and 
that our multi-task estimator often perform much better than its single-task counterpart. 
Up to the best of our knowledge, a theoretical proof of this point remains an open problem 
that we intend to investigate in a future work. 
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Figure 4: Increasing the sample size n (Setting B), quadratic errors of multi-task estimators 
(np)-iE[||/,i^nar,5-/f]- Blue: S = t. Red: S = S. 



Theorem [10] only holds when matrices Ai can be diagonalized simultaneously (assump- 
tion (jHMp ). which often corresponds to cases where we have a prior knowledge of what 
the relations between the tasks would be, and which is the only known case where the 
optimization is quite easy. We do plan to expand our results to larger sets Ai, which may 
require new concentration inequalities and new optimization algorithms. 
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We give in Appendix the proofs of the different results stated in Sections [21 [4] and [3 
The proofs of our main results are contained in Sections [D] and [E[ 

Appendix A. Proof of Proposition [2] 

Proof It is sufficient to show that {■,-)g is positive-definite on Q. Take g £ G and S = 
{Si,j)i<i<j<p the symmetric postive-definite matrix of size p verifying S*^ = M, and denote 
T = S~^ = (?ij)i<jj<p. Let / be the element of Q defined by Vi G {l...p}, g{-.,i) = 
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Figure 5: Increasing the sample size n (Setting B), quadratic errors of single-task estimators 
M-iE[||^nd,5 - f\n Blue: S = t. Red: 5 = S. 



Yjk=iT^i,kf{-,k). We then have: 



p p 

i=\ j=i k=i 1=1 
p p p p 

j=l k=l 1=1 i=l 

j=i k=i 1=1 
p p p 

= T.Y.^iAfi-^k),f{.,i))^Y.'^i,iM.T),,k 



k=l 1=1 
P P 



i=i 



fc=i 1=1 

Eii/(-,fc)ii: 



&■ 



k=l 
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Figure 6: Increasing the sample size n (Setting B), improvement of multi-task compared to 
single-task: E[||/_. g - /||Vll/;,. £ " /f ]• 



This shows that {g,g)g > and that 



0^f = 0^g = 0. 



Appendix B. Proof of Corollary [T] 

Proof If (x, j) £ X X {1, . . . ,p}, the application (/^, . . . , /^) i— )• f^{x) is clearly continuous. 
We now show that {Q, {■, ■)g) is complete. If {gn)n&i is a Cauchy sequence of Q and if we de- 
fine, as in Section [Al. the functions /„ by Vn G N, Vi G {1 . . .p}, gn{-,i) = ^2^=1 Ti^kfn{-, k). 
The same computations show that {fn{-,i))nm are Cauchy sequences of J^, and thus con- 
verge. So the sequence (/n)ngN converges in Q, and {gn)nm does likewise. ■ 



Appendix C. Proof of Proposition [4] 
Proof We define 



$(x,j) =M~^ 



^6ij^{x) 



^6pj<^{x)^ 
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Figure 7: Increasing the signal-to- noise ratio (Setting C), quadratic errors of multi-task 



estimators (np) ^IE[||/< 



similar, 5 



/r]. Blue: S = S. Red: 5 = S. 



with 6ij = li=j being the Kronecker symbol, that is, 5ij = 1 if i = j and otherwise. We 
now show that $ is the feature function of the RKHS. For g £ G and {x,l) £ X x {1, . . . ,p}, 
we have: 



p p p 

jf=l i=l m=l 
P P 

= E E ^^ • ^~^)j,mS.^,lg{x^j) 
jf=l m=l 

P 



Thus we can write: 
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Appendix D. Proof of Theorem [8] 
D.l Some useful tools 

We now give two properties of the Kronecker product, and then introduce a useful norm 
on 5p(M), upon which we give several properties. Those are the tools needed to prove 
Theorem [8l 

Property 1. The Kronecker product is bilinear, associative and for every matrices A,B,C,D 
such that the dimensions fit, {A B){C ® D) = (AC) (^ (BD). 



Property 2. Let A £ A^„(M), {A ® I„ 



^T 



{A^ ^In). 



Definition 11. We now introduce the norm 



on Sr 



which is the modulus of the 



eigenvalue of largest magnitude, and can be defined by 

= sup 



z Sz 



This norm has several interesting properties, some of which we will use and which are stated 
below. 



Property 3. The norm ||| • ||| is a matricial norm: V(yl, B) G Sp 



WABl < III A III III 5 II 



We will use the following result, which is a consequence of the preceding Property. 



V5g5„(M), VTg^^ 



IIT~2 'iT~2 III < III '?IIIIIIT~^III 



We also have: 
Proposition 12. 



VSGcS, 



\\T.®Ir, 



Proof We can diagonalize S in an orthonormal basis: 3[/ G On(^), 3D = Diag(^i, 
U DU . We then have, using the properties of the Kronecker product: 

E 4 = (C/^ ® /„,)(Z) /„)(([/ /„) 

= {U (^ IrS" {D ® In){{U ® In) . 



.^p) 



We just have to notice that U ® In £ O 



np 



and that: 



D®In = Diag(/ii, . . . , Hi, . . . , fip, . . . , fip) . 



n times 



This norm can also be written in other forms: 

Property 4. // M G A^„(]R), the operator norm ||M||2 := supjg]gn\{o| \ \u\\ \ is equal 
to the greatest singular value of M: y/ p{M~^ M) . Henceforth, if S is symmetric, we have 

--WSh 
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D.2 The proof 

We now give a proof of Theorem [HI using Lemmas [T3l [2] and \T5\ which are stated and 
proved in Section ID.3i The outUne of the prove is the following: 

1. Apply Theorem [5] to problem (IPz|) for every z £ Z in order to 

2. control ||s — Ciloo with a large probability, where s, C G Mp(p+^)/2 are defined by 

s := (Si,i, . . . , Sp,p, Si,i + S2,2 + 2Si,2, • • • , ^i,i + Sjj + 2Iljj, . . .) 
and C ■= (a(ei),...,a(ep),a(ei + 62),... ,a(ei + ep),a(e2 + 63), . . . ,a(ep_i + e^)) . 

3. Deduce that S = J(C) is close to S = J{s) by controlling the Lipschitz norm of J. 

Proof 1. Apply Theorem [5} We start by noticing that Assumption (jHdfj) actually 
holds true with all Xqj equal. Indeed, let (Ao,j)i<j<p be given by Assumption ()Hdf|) and 
define Aq := minj=i^...^p Aqj. Then, Aq G (0, +00) and df(Ao) since all Aqj satisfy these 
two conditions. For the last condition, remark that for every j £ |1, . . . ,p). An < An,j and 



A I— >• \\{Ax — /)Fe II2 is a nonincreasing function (as noticed in lArlot and BachI (I2OIII ) for 
instance), so that 



I IK^Ao -/n)Fe,||^ < I IK^Ao,, -/n)F,J^ < ^,,f-^ • (D.l) 

In particular, Eq. (j3.2p holds with dn = I for problem (jPzP whatever z £ {ei, . . . , Cp}. 

Let us now consider the case z = Ci + Cj with i ^ j G {1, . . . ,p}. Using Eq. (|D.1|) and 
that Fe.+e^ = Fe.^ + Fg^. , we have 

The last term is bounded as follows: 

2{iBx, - In)Fe,,iBxo - In)Fe,) < 2||(Bao " QFeM ■ IK^Ao " 4)^e, || 



< ^/nln{n){T,i^i + T.jj) 

< (1 + c(S))Vnln(n)(Si,i + J^jj + 2Sij) 

= (i+c(s))7;n^4+,^, , 

because Lemma [13] shows 

2(Si,i + Sjj) < (1 + c(S))(Si,i + Sj- J + 2Si,j-) . 

Therefore, Eq. (|3.2p holds with d„ = 1 + c(S) for problem (jPz^ whatever z £ Z. 
2. Control ||s — Clloo* -1^^* us define 



r?i:=/3(a + ,5)(l + c(S))i'^''^''^ 



n 
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By Theorem [5l for every z £ Z, an event ^z of probability greater than 1 — ku exists on 

which, if n > no((^), 

(l-r?i)a2<a(z)<(l+r?iK2 _ 

So, on n := flzgz^z, 

||C-^lloo<mlklL , (D-2) 

and P(0) > 1 — Kp{p + l)/2n~ by the union bound. Let 

f ll^ll 
IISII^ := sup|Sjj| and Ci{p) := sup < ?° 

ij SG5p(K) I infill 

Since \\s\\^ < 4 ||S||oo &i^d C'i(p) = 1 by Lemma [T^ Eq. ()D.2|) imphes that on fi, 

||C-^lloo<4^il|S|L<4r?i|||S||| . (D.3) 

3. Conclusion of the proof: Let 

C2{p) := sup 

(^eKp(p+l)/2 

By Lemma la C2(p) < |p. By Eq. ([El]), on f), 

lis - Sill = III J(C) - J(s)||| < C2{p) lie - 5|L < 4??iC2(p)|||S||| . (D.4) 

Since 

|||S-5SS~^ - ipiii = |||s-^(s - s)s-^||| < |||s-^||||||s - S||| , 

and |||S||||||S-i||| = c(E), Eq. (jDlij) imphes that on n, 

|||S"isS-^ -/pi < 4r?iC2(p)|||S||||||S-i||| = 47?iC2(p)c(S) < 67?ipc(S) . 



To conclude, Eq. (|4.'2|) holds on Cl with 



r) = 6pc(S)/3(a + 5){l + c(S))W ^-^ < Li(a + 6)p\ ^-^c(S)' (D.5) 

" n V n 



!"("-) / r ^„ , JCN^. / l^W „^x^^2 



for some numerical constant Li. 



Remark 14. As stated in \Arlot and Bacn 1(201 i ). we need y^nQ{6) / ln{no{6)) > 504 and 



./M^/HM^)) > 24(290 + 6). 

Remark 15. To ensure that the estimated matrix T, is positive-definite we need that rj < 1, 

that is, 

fi 

>6/3(a + 5)pc(S)(l + c(S)) . 



ln(n 



23 



SOLNON, ArLOT and BACH 

D.3 Useful Lemmas 

Lemma 13. Let p > 1, S G 5j|'"'"(M) and c(S) its condition number. Then, 

Remark 16. The proof of Lemma\T^ shows the constant '2(j:\+i cannot be improved without 
additional assumptions on S. 

Proof It suffices to show tlie result wlien p = 2. Indeed, (|D.6p only involves 2x2 
submatrices Tj{i,j) G 5^^(]R) for which 

l<c(S)<c(S) hence < ^£^ < 41^ • 

" " c(S) + 1 - c(S) + 1 

So, some G M exists such that S = |||S|||i?^Di?5i where 

R.:=i ^°^(1 ^^"(.!!^ D=(l '] and A :^ ' 



■sm{9) cos{e)J \0 Xj • c(S) 

Therefore, 



l-A 

/ I "I i«^ I I / I — I — ^ ^a, I I I I I / I 

s = lis,,, ^ 



'cos^(0) + Asin^(0) i^sin(20) 

So, Eq. ()D.6P is equivalent to 



i-^sin(20) Acos^(0) + sin^(^) 



(1-A)sin(26i) 1-Al + A 

2 - ~1+A 2 ' 

which holds true for every ^ G M, with equality for 9 = -k/2 (mod. vr). 



Lemma 14. For every p>l,Ci{p) := supY:eSp(R) If^ = ^ • 

Proof With S = /p we have ||S||oo = |||S||| = 1, so Ci{p) > 1. 

Let us introduce (i, j) such that |Sjj| = ||S||oo- We then have, with e^ being the k^^ vector 

of the canonical basis of MP, 

|Sij| = le^Sejl < |e7Se,|i/2|eJSejf /2 < (||S||^/')2 . 



y(C)lll 



Lemma 15. For every p > 1, let C2{p) := supAgj[jp(p+i)/2 l A . Then, 

\ < C2{p) < Ip . 
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Proof For the lower bound, we consider 



Ci = (l,...,l, 4,...,4 ), then J(Ci) 



p times Pip^ ^.jjj^gg 



SO that |||J(C)||| =pand 



.1 ... 1> 



For the upper bound, we have for every (^ E l^PiP+i)/^ g^j^j ^ ^ 



such that ||z||2 = 1 



z'^JiOz 



l<i,j<p 



«J 



By definition of J, || J(C)||oo < 3/2 



< E I^^IN.II«^(C)l<ll^(C)ILII^II? • 

l<^,jl'<p 

^. Remarking that ||z||f < p||z||2 yields the result. 



Appendix E. Proof of Theorem 1101 

The proof of Theorem 1101 is similar to the proof of Theorem 3 in lArlot and BachI ( 201ll ). 



We give it here out of completeness. 

E.l Key quantities and their concentration around their means 
Definition 16. We introduce, for S € Sp^ 



Mo{S) G argmin ■ 



Fm-Y 



+ 



2tr {Am -{S^ In))] 



(E.l) 



Definition 17. Let S € 5p(R), we note S^ the symmetric matrix where the eigenvalues 
of S have been thresholded at 0. That is, if S = U^ DU, with U € C'p(M) and D = 
Diag((ii, . . . ,dp), then 

S+ := U Diag (max{di,0} , . . . ,max{(i„,0}) [/ . 

Definition 18. For every M £ A4, we define 

biM) = \\iAM-Inp)f\\l , 

vi(M) = E [{e,AMe)] = tr(^M • (S » /„)) , 

5i{M) = {e, Ams) - E [{e, Ams)] = {e, Ams) - tr(^Af • (S In)) , 

V2(M) = E [WAMsg] = tT{Al,AM • (S (g) In)) , 

52{M) = WAmsWI - E [WAmsWI] = UmsWI " tr(^I,^M • (S ® /„)) , 
63{M) = 2{AMe,{AM-Inp)f) , 
6^{M) = 2{e,{Inp-AM)f) , 
A(M) = -25i{M) + 64{M) . 
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Definition 19. Let Ca, Cb, Cq, Cd, Ce, Cp be fixed nonnegative constants. For every x > 
we define the event 



^x = ^x{M,Ca,Cb,Cc,Cd,Ce,Cf) 



on which, for every M ^ M. and ^i, 6*2, ^3, ^4 G (0, 1]; 



\5i{M)\ < Oi tr [AI,Am ■ (S ® /„)] + {Ca + CBO^^)x\\m 
\S2{M)\ < 02 tr [aIjAm ■ (S /„)) + {Cc + CDO^^)x\m 

MM)] < 03 \\{Inp - AM)f\\l + CEe^^xmi 
MM)\ < 9i\\{Inp-AM)f\\l + CFe^^xmi 



(E.2) 

(E.3) 

(E.4) 

(E.5) 



Of key interest is the concentration of the empirical processes 6i, uniformly over M S 
Ai. The following Lemma introduces such a result, when M contains symmetric matrices 
parametrized with their eigenvalues (with fixed eigenvectors). 



Lemma 20. Let P e Op{R), and suppose that (iHMJl holds. ThenF{n^{M,CA,CB,Cc,CD,CE,CF)) > 

1 _ pel027+ln(n)g-x -f 



Ca = 2, Cb = 1, Cc = 2, Co = 1, Ce = 306.25, Cf = 306.25 



Proof We can write 



Am = Ad,,...4^ = (P^In) 



{D-^ » K) {D-^ ^K + npLnp) ^1 (P ® In) 



Q Ad^^,„^dj,Q , 



with Q = P ® /:„ and ^di,...,dp = {D'^ ® K){D~^ ®K + nplnp)~^. Remark that Ad,,...,dj, 
is block-diagonal, with diagonal blocks being Bd^^, . . . , Bdp using the notations of Section [3l 

With e = Qe = {ei^ , . . . , £p^)~^ and / = Qf = (/i , . . . , fp )^ we can write 



\Si{M)\ = { 


e,^di,...,dp 


e)-E 


{e,Ad,,...,d^ 


^ 


MM)\ = 


M,...,dp^ 


2 

-E 
2 


Ad^,...,dpS 


2" 
2 



MM)\ = 2{Ad,_d^e,{Ad„...,d,j- Inp)f) 
MM)\=2{e,iInp-Ad,,...,d^)f) . 
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We can see that the quantities 5i decouple, therefore 

p 

\Si{M)\=Y,{^,,Bd^e,)-Em,Bd^e)] , 

i=l 
P 

\52{M)\ = Y,\\BdMl-^ 

P _ 

\5s{M)\ =Y,'^{B,^e„{B,^ - In)f.) , 

p _ 

MM)\=Y,'^{e„iIn-Bd^)fi) . 



\Bd,£' 



di£i\\2 



j=l 



Using Lemma 9 of lArlot and BachI (|201lh . where we have p concentration results on the 
sets Qi, each of probability at least 1 — e-^'^^'^+™(")e~^ we can state that, on the set n?=i ^jj 
we have 



1^1 (M)| < ^ 01 Var[ei] tr(5j;5rfj + {Ca + CB9^')xYal[ei] , 

i=l 
P 

MM)\ <Y,^2 Var[ei] tr{BlBd,) + {Cc + CDe^')xYai[ei] , 

P II ~ 2 

|<54(M)| <J2^4 \\{In - Bd,)fi + Cpe^'x Var[£i] . 



j=i 



To conclude, it suffices to see that for every i £ {1, . . . ,p}, Var[ei] < |||S|||. 



E.2 Intermediate result 

We first prove a general oracle inequality, under the assumption that we use inside the 
penalty an estimation of S which does not underestimate S too much. 

Proposition 21. Let Ca,Cb,Cc,Cd,Ce > be fixed constants, 7 > 0, ^^ G [0,1/4) and 
Ks > 0. On 0^in(„)(A^, Ca, Cb, Cc, Cd, Ce), for every S G 5++(M) such that 

S^Efl-.. inf l"iM)+MMHKsHn)m]\ ,^.6) 

- V mgmX vi{M) ]) ^ ' 

and for every 9 G (0, (1 — 46*5)72), we have: 
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1 

np 
+ 



fhloiS) - f 

1 

1-26- AOs 



< 



1 + 26 



inf <^ — 



1 -29 - 46s MeM { np 



Fm-F 



2 , 2tr(^M-((5-S)+®/„)) 



+ 



{2Ca + 3Cc + 6Cd + 6Ce + \{Cb + Cf))i + ^^ 



np 

ln(n)|||S|| 
np 



(E.7) 



Proo f The proof of Proposition[2T]is very similar to the one of Proposition 5 in lArlot and BachI 
(|201ll ). First, we have 



/m - / 
Im -y 



b{M) + V2iM) + 62{M) + 63iM) , 

II/m - /Hi - 2vi{M) - 26i{M) + 6i{M) + \\e\\l 



Combining Eq. (lE.ip and (lE.Op . we get: 

2 



A/o{S) 



< inf 



/ ^ + 2tr (^Aj7_^(^) • {{S - S)+ /„)j + A(A4(5)) 
/m - / ' + 2 tr {Am ■ {{S - S) » /„)) + A(M) 



(E.8) 
(E.9) 

(E.IO) 



On the event il 



7ln(n)5 



for every 9 e (0,1] and M e M, using Eq. (jK2]) and (jK5]) with 



'1 



1 



|A(M)| <6{b{M) + V2{M)) + {CA + -^iCB + CF)hHn)mi ■ 
Using Eq. (JETS]) and (JKij) with 62 = 63 = 1/2 we get that for every M G A^ Eq. 

Fm-F 
which is equivalent to 



(E.ll) 



^ > ^{b{M) + V2{M)) - (Cc + 2Cz) + 2CE)7ln(n)|||S||| , 



b{M) + V2iM) < 2 



Fm - F 



+ 2{Cc + 2CD + 2CEhln{n)\m\ 



(E.12) 



Combining Eq. ()KTT]) and ()KT2]) . we get 

2 



|A(M)| < 261 



Fm - F 



+ {Ca + {2Cc + 4Cd + 4Ce)6 + {Cb + Cp)^ 7ln(n)|||S||| . 



With Eq. (lElOl) . and with Ci=Ca,C2 = 2Cc + 4Cd + 4Ce and C3 = Cb + Cp we get 



:i - 29) 



/: 



Mo{S) 



f 



2 + 2tr(^M„(5) 



•((5-S)+®/„) < 



inf 

A/G7M 



/a/ - / 



+ 2tr {Am ■ {{S - S) ^ /„)) ^ + Ci + Ca^ + ^ 7ln(n)|||S 



C3 



(E.13) 



Using Eq. ()E.6P we can state that 



tr L4 



{{S - S) » /„) > -^5 {b{Mo{S)) + V2{Mo{S)) + Ks Hn)\\S\\ 



Mo(S) 

which then leads to Eq. (|E?7ll using Eq. ([El^ and (|KT3]1 . 
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E.3 The proof itself 

We now show Theorem [10] as a consequence of Proposition [2TJ It actuaUy suffices to show 
that S does not underestimate S too much, and that the second term in the infimum of 
Eq. ()E.7P is neghgible in front of the quadratic error {np)~^\\fM — /P- 
Proof On the event O introduced in Theorem [8l Eq. ()4.2|) holds. Let 



7 = c(S)(l + c(S)) . 



By Lemma [22] below, we have: 



.^f f b{M)+V2{M)+KsHn)\im \ ^ ^^ /i^5ln(n)|||S|| 

M&M \ 



vi{M) i ~ V "tr(S) 

Li order to have Mo{T,) satisfying Eq. ()E.6p . it suffices to have, for every 0s > 0, 



20s./^^:";'"l^'=6/?(« +%.,/-<") 



ntr(S) 



n 



which leads to the choice 



Ks 



3/3(a + (5)7tr(S) 



We now take Og = 9 = (91n(n)) '^. Using Eq. (|E.7p and requiring that ln(n) > 6, we get 
on i7 = J7 n fl(Q+5)ii-i(„)(7W,CA,CB,Cc,C£),C£;,Ci?): 



1 
np 



Jm J 



2 \ ln(n)/ MeM np 

-1 



Im - f 



+ 



2tr(AM-((S-S)+0/„: 



np 



1- 



31n(n 



-) \2Ca + 3Cc + 6Cd + 6Ce + ln(n) (i8Cb + ISCp + ^^^^ .^nlnf ^^^ 

X {a + 5) 



,2Hnnn\ 



Using Eq. (jD.Sp and defining 

m := 12/3(a + 5)p\/^c{J:) (1 + c(S)) 
V n 

we get 



1 



31n(n^ 



-1 



^(1 + r^) i^f I- 

2 \ mln) J M€M \np 



Im - f 



+ m 



np 

(E.14) 

tr(^M • (S ^ /„)) 
np 



2Ca + 3Cc + QCd + 6Ce + ln(n) 18Cb + ISCp + ^ .in^m, ^ ' 



x{a + 6) 



2 ln(n)^|||E||| 
np 
(E.15) 
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Now, to get a classical oracle inequality, we have to show that rj2Vi{M) = r]2ti{AM ■ 
(S (g) /„)) is negligible in front of ||/a/ — /|p. Lemma [22] ensures that: 



\fM eM,yx>0, 2. 



I X Ills III 
ntr(S) 



vi{M) <t;2(M)+x|||S|| 



With < C„ < 1, taking x to be equal to 72/3V ln(n)c(S)2(l + c(S))2 tr(S)/(C„|||S|||) leads 

to 

72/3V ln(n)c(S)2(l + c(S))2 tr(S) 



mVliM) < 2CnV2{M) + 



Gri 



(E.16) 



Then, since V2{M) < V2{M) + b{M) and using also Eq. (lE.Sp . we get 



V2{M) < 



Im - f 



+ \62{m)\ + \63{M)\ 



On i7 we have that for every 9 G (0, 1), using Eq. (jE.SP and ()E.4p . 



|(52(M)| + |53(M)| < 2^ 
which leads to 

MM)\ + \6s{M)\< 



Im - f 



62{M)\ - MM)\]+{Cc+{CD+CE)e-^){a+6)Hn)\m 



29 
1 + 29 



Im - f 



+ 



Cc + {Cd + Ce)9-' 



1 + 29 



(a + 5)ln(n)|||S|| 



Now, combining this equation with Eq. (jE.lGp . we get 



ri2Vi{M) < 1 + 



1 + 26 



Im - f 



+ 20, 



Cc + {Cd + Ce)9-' 



(a + 5)ln(n)|||S|| 



+ 



1 + 29 
72/3V ln(n)c(S)^(l + c(S))^ tr(£) 

On 



Taking 9 = 1/2 then leads to 



mviiM) < {1 + Cn) 



Im - f 



+ Cn{Cc + 2{Cd + CE)){a + 5) ln(?i)|||S|| 



+ 



72/3V ln(n)c(S)^(l + c(S))^ tr(£) 

Or). 



We now take C„ = l/ln(n). We now replace the constants Ca, Cb, Cc, Cd, Ce, Cp by 
their values in Lemma [20l and if we require that 21n(n) > 1027, we get, for some constant 

^3, 



1 



31n(?i) 



-1 



, , , / 729/32pVtr(S)2\ / 1 

185L5 + In n 5530.5 + f,„ '„,^ ^ ^ + 616.5 1 + 

V 4 S P 



72/3 V ln(n)c(£)^(l + c{^)f tr(S) 

On. 



ln(n)y ln(n 
<L3ln(n)/c(S)^^ 

infill 

(E.17) 
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From this we can deduce Eq. (j5.2p . 

Finally we deduce an oracle inequality in expectation by noting that if n~^||/r> — /|p < 
Rn (5 on O, using Cauchy-Schwarz inequality 



E 



1 
np 



J M J 



E 



np 



Jm J 



+ E 



np 



J A.r J 



np 



<E[i?„,,] + ^J^^^(^^±il±i^WE 



n" 



J i\r J 



(E.18) 



We can remark that, since |||AAf ||| < 1, 

2 



fhi - f 



< 2 PMe|l2 + 2 Wilnp - ^A/)/|l2 < 2 ||ey + 8 \\f\\i 



So 



E 



J h.r J 



' M 



< 12 (np|||S||| +4 



together with Eq. (jE.lSp and Eq. (jE.lSp . induces Eq. (|5.3p . using that for some constant 
L4 >0, 



uJ '''^+}>^'' (m + ^iffi)<L. " 



n" 



np 



n 



(5/2 



I sill + 



1 
np 



We can finally define the constant L2 by: 



L3c(l;) tr(l;)(a + d) h ^4— f7^|||S||| < L2c(l;) tr(l;)(a + o) 



np 



np 



Lemma 22. Let n,p >1 be two integers, x >0 and S G 5j^^(M). Then, 

' ti{A^ A ■ {J: (^ In)) + x\m ' 



inf 

Ag7V1„p(R),|||A|||<1 



tr(^ • (S In)) 



> 24 



ntr(S) 



Proof First note that the bilinear form on A^„p(M), {A, B) i— t- ti{A^ B- (S(8)/„)) is a scalar 
product. By Cauchy-Schwarz inequality, for every A G A^„p(M), 



tx{A ■ (S 7:„))2 < tr(S /„) tr(^^A • (S ® /„)) . 
Thus, since tr(S (g) /„) = ntr(i;), if c = tr{A • (S (g) /„)) > 0, 



tr(A"^A-(S®^))> 



ntr(S) 
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Therefore 



.^f ; tr(A^A-(s^/„)) + x|||£||| | ^ .^ r c ^ x\m\ 



A\\\<i[ tr(^ • (S (g) /„)) J~c>o[ntr(S) c 
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